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An execution unit is 
provided for executing a first 
instruction which includes an 
opcode field, a first operand field, 
and a second operand field. The 
execution unit includes a first 
input register for receiving a 
first operand specified by a value 
of the first operand field, and a 
second input register for receiving 
a second operand specified by 
a value of the second operand 
field. The execution unit further 
includes a comparator unit which 
is coupled to receive a value 
of the opcode field for the first 
instruction. The comparator unit 
is also coupled to receive the first 
and second operand values from 
the first and second input registers, 
respectively. The execution 
further includes a multiplexer 
which receives a plurality of 
inputs. These inputs include a 

first constant value, a second constant value, and the values of the first and second operand. If the decoded opcode value received by 
the comparator indicates that the first instruction is either a compare or extreme value function, the comparator conveys one or more 
control signals to the multiplexer for the purpose of selecting an output of the multiplexer as the result of the first instruction. If the first 
instruction is one of a plurality of extreme value instructions, the one or more control signals conveyed by the comparator unit select 
between the first operand and second operand to determine the result of the first instruction. If the first instruction is one of a plurality of 
compare instructions, the one or more control signals conveyed by the comparator unit select between the first and second constant value 
to determine the result of the first instruction. In another embodiment, a similar execution unit is provided which handles vector operands. 
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WO 99/19791 PCT/US98/12666 
MICROPROCESSOR COMPRISING INSTRUCTIONS TO DETERMINE EXTREME VALUES _ 



Field of the Inventi n 

This invention relates to computer systems and microprocessors, and more particularly to a multimedia 
execution unit incorporated within a microprocessor for accommodating high-speed multimedia applications. 
The invention further relates to extreme value functions and vector processing implemented within 
microprocessor based systems. 

Description of the Related Art 

Microprocessors typically achieve increased performance by partitioning processing tasks into multiple 
pipeline stages. In this manner, microprocessors may independently be executing various portions of multiple 
instructions during a single clock cycle. As used herein, the term "clock cycle" refers to an interval of time 
during which the pipeline stages of a microprocessor perform their intended functions. At the end of the clock 
cycle, the resulting values are moved to the next pipeline stage. 

Microprocessor based computer systems have historically been used primarily for business applications, 
including word processing and spreadsheets, among others. Increasingly, however, computer systems have 
evolved toward the use of more real-time applications, including multimedia applications such as video and audio 
processing, video capture and playback, telephony and speech recognition. Since these multimedia applications 
are computational intensive, various enhancements have been implemented within microprocessors to improve 
multimedia performance. For example, some general purpose microprocessors have been enhanced with 
multimedia execution units configured to execute certain special instructions particularly tailored for multimedia 
computations. These instructions are often implemented as "vectored" instructions wherein operands for the 
instructions are partitioned into separate sections or vectors which are independently operated upon in accordance 
with the instruction definition. For example, a vectored add instruction may include a pair of 32-bit operands, 
each of which is partitioned into four 8-bit sections. Upon execution of such a vectored add instruction, 
corresponding 8-bit sections of each operand are independently and concurrently added to obtain four separate 
and independent addition results. Implementation of such vectored instructions in a computer system furthers the 
use of parallelism, and typically leads to increased performance for certain applications. 

One type of commonly employed function in multimedia applications is a compare function. A compare 
function is typically implemented though the execution of a compare instruction which compares the value of one 
operand against another to determine whether the value of the first is greater than, equal to, or less than the other. 
A compare instruction may be treated as a vectored instruction wherein corresponding sections of associated 
operands are compared independently of other sections of the operands. 

Another set of functions commonly utilized in multimedia processing are the extreme value functions. 
As used herein, "extreme value functions" are those functions which return either a minimum value selected 
among a plurality of values, or a maximum value selected among a plurality of values as a result of the function. 
In typical multimedia systems, a minimum value or a maximum value is obtained through the execution of 
several sequentially executed instructions. For example, a compare instruction may first be executed to determine 
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the relative magnitudes of a pair of operand values, and subsequently a conditional branch instruction may be _ _ 
executed to determine whether a move operation must be performed to move the extreme value to a destination 
register or other storage location. These sequences of commands are common in multimedia applications, such as 
clipping algorithms in graphics rendering systems. Since extreme value functions are implemented through the 
5 execution of multiple instructions, however, a relatively large amount of processing time may be consumed by 
such operations. 

It would therefore be desirable to provide a multimedia execution unit in a microprocessor which is 
capable of obtaining an extreme value through the execution of a single instruction. It would further be desirable 
to provide a multimedia execution unit with an efficient hardware implementation of the extreme value 
1 0 instructions. 

Summary of the Invention 

The problems outlined above are in large part solved by an execution unit in accordance with the present 
invention. In one embodiment, an execution unit is provided for executing a first instruction which includes an 
opcode field, a first operand field, and a second operand field. The execution unit includes a first input register 

1 5 for receiving a first operand specified by a value of the first operand field, and a second input register for 

receiving a second operand specified by a value of the second operand field. The execution unit further includes 
a comparator unit which is coupled to receive a value of the opcode field for the first instruction. The comparator 
unit is also coupled to receive the first and second operand values from the first and second input registers, 
respectively. The execution further includes a multiplexer which receives a plurality of inputs. These inputs 

20 include a first constant value, a second constant value, and the values of the first and second operands. If the 
decoded opcode value received by the comparator indicates that the first instruction is either a compare or 
extreme value function, the comparator conveys one or more control signals to the multiplexer for the purpose of 
selecting an output of the multiplexer as the result of the first instruction. If the first instruction is one of a 
plurality of extreme value instructions, the one or more control signals conveyed by the comparator unit select 

25 between the first operand and second operand to determine the result of the first instruction. If the first 
instruction is one of a plurality of compare instructions, the one or more control signals conveyed by the 
comparator unit select between the first and second constant value to determine the result of the first instruction. 
In the case of the compare instructions, the value of the first and second constants may be advantageously chosen 
in order to form a mask for use by subsequent instructions. In another embodiment, a similar execution unit is 

30 provided which handles vector operands. 

The extreme value functions are thus implemented in a single instruction. This advantageously results in 
improved performance for these instruction, which are particularly important in multimedia applications. An 
efficient hardware implementation is also achieved as the circuitry used for the extreme value operations is also 
shared by a plurality of compare operations. 

35 Broadly speaking, the present invention contemplates a microprocessor configured to execute a first 

instruction, wherein an encoded representation of said first instruction includes an opcode field, a first operand 
field, and a second operand field. The microprocessor comprises an execution unit coupled to receive a decoded 
value of the opcode field, a first operand specified by a value of the first operand field, and a second operand 
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specified by a value of the second operand field, wherein the execution unit is configured to perform an ex trem e 
value operation on the first operand and the second operand in response to receiving the decoded value of the 
opcode field. The execution unit is further configured to convey an output value of the extreme value operation 
as a result of the first instruction. 
5 The present invention further contemplates an execution unit in a microprocessor for executing a first 

instruction, wherein an encoded representation of the first instruction includes an opcode field, a first operand 
field, and a second operand field. The execution unit comprises a first input register coupled to receive a first 
operand specified by a value of the first operand field and a second input register coupled to receive a second 
operand specified by a value of the second operand field. The execution further comprises a comparator unit 
10 coupled to receive the first operand from the first input register and the second operand from the second input 
register. The comparator unit is coupled to receive a decoded opcode value of the opcode field on a decoded 
opcode bus. Still further, the execution unit comprises a multiplexer coupled to receive a plurality of inputs 
including the first operand from the first input register, the second operand from the second input register, a first 
constant value, and a second constant value. The multiplexer is configured to select one of the plurality of inputs 

15 to be conveyed as a result of the first instruction in response to receiving one or more control signals from the 
comparator unit. The comparator unit is configured to generate the one or more control signals in response to 
receiving the decoded opcode value, and, if the decoded opcode value indicates that the first instruction is one of 
a plurality of extreme value instructions, the one or more control signals are usable to select either the first 
operand or the second operand as the output value. If the decoded opcode value indicates that the first instruction 

20 is one of a plurality of compare instructions, the one or more control signals are usable to select either the first 
constant value or the second constant value as the output value. 

The present invention still further contemplates an execution unit in a microprocessor for executing a 
first instruction, wherein an encoded representation of the first instruction includes an opcode field, a first 
operand field, and a second operand field. The execution unit comprises a first input register coupled to receive a 

25 first operand specified by a value of the first operand field, wherein the first operand includes a first vector value 
followed by a second vector value, as well as a second input register coupled to receive a second operand 
specified by a value of the second operand field, wherein the second operand includes a third vector value 
followed by a fourth vector value. The execution unit further comprises a comparator unit coupled to receive the 
first operand from the first input register and the second operand from the second input register. The comparator 

30 unit is coupled to receive a decoded opcode value of the opcode field on a decoded opcode bus. Additionally, the 
execution unit comprises a first multiplexer coupled to receive a first plurality of inputs including the first vector 
value from the first input register, the third vector value from the second input register, a first constant value, and 
a second constant value. The first multiplexer is configured to select one of the first plurality of inputs to be 
conveyed as a first portion of a vector result of the first instruction in response to receiving a first set of control 

35 signal values from the comparator unit. The comparator unit is further configured to generate the first set of 

control signal values in response to receiving the decoded opcode value. If the decoded opcode value indicates 
that the first instruction is one of a plurality of extreme value instructions, the first set of control signal values are 
usable to select either the first vector value or the third vector value as the first portion of the vector result. If the 
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decoded opcode value indicates that the first instruction is one of a plurality of compare instructions, the fir st se t 
of control signal values are usable to select either the first constant value or the second constant value as the first 
portion of the vector result. 

The present invention additionally contemplates a method for executing a first instruction within an 
5 execution unit of a microprocessor, wherein an encoded representation of the first instruction includes an opcode 
field, a first operand field, and a second operand field. The method comprises conveying a first plurality of inputs 
to a comparator unit within the execution unit, wherein the first plurality of inputs includes a first operand 
specified by a value of the first operand field, a second operand specified by a value of the second operand field, 
and a decoded opcode value which corresponds to an encoded opcode value of the opcode field. The method 

10 further comprises conveying a second plurality of inputs to a multiplexer within the execution unit, wherein the 
second plurality of inputs includes the first operand, the second operand, a first constant value, and a second 
constant value. Still further, the method comprises generating a set of control signal values from the comparator 
unit in response to receiving the first plurality of inputs. The method next comprises conveying one of the second 
plurality of inputs from the multiplexer as a result of the first instruction in response to receiving the set of control 

1 5 signal values. The result is selected from the first operand and the second operand according to the set of control 
signal values if the decoded opcode value indicates that the first instruction corresponds to one of a plurality of 
extreme value instructions. The result is selected from the first constant value and the second constant value 
according to the set of control signal values if the decoded opcode value indicates that the first instruction 
corresponds to one of a plurality of compare instructions. 

20 Finally, the present invention contemplates a microprocessor configured to execute a first instruction. 

The microprocessor comprises an instruction cache configured to store an encoded representation of the first 
instruction, wherein the encoded representation includes an opcode field, a first operand field, and a second 
operand field. The microprocessor further comprises a decode unit coupled to receive the encoded representation 
of the first instruction from the instruction cache, wherein the decode unit is configured to generate a decoded 

25 opcode value in response to receiving a value of the opcode field. The microprocessor still further comprises an 
execution unit coupled to the decode unit, wherein the decode unit is further configured to cause a first operand 
and a second operand to be conveyed to the execution unit. The first operand is specified by a value of the first 
operand field, while the second operand is specified by a value of the second operand field. 

The execution unit includes a first input register coupled to receive a first operand specified by a value of 

30 the first operand field, as well as a second input register coupled to receive a second operand specified by a value 
of the second operand field. The execution unit further includes a comparator unit coupled to receive the first 
operand from the first input register and the second operand from the second input register, The comparator unit 
is further coupled to receive a decoded opcode value of the opcode field on a decoded opcode bus. The execution 
unit still further includes a multiplexer coupled to receive a plurality of inputs including the first operand from the 

35 first input register, the second operand from the second input register, a first constant value, and a second constant 
value. The multiplexer is configured to select one of the plurality of inputs to be conveyed as a result of the first 
instruction in response to receiving a control signal value from the comparator unit. The comparator unit is 
configured to generate the control signal value in response to receiving the decoded opcode value. If the decoded 
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opcode value indicates that the first instruction is one of a plurality of extreme value instructions, the contr ol 
signal value is usable to select either the first operand or the second operand as the output value. If the decoded 
opcode value indicates that the first instruction is one of a plurality of compare instructions, the control signal is 
usable to select either the first constant value or the second constant value as the output value. 
5 Brief Description of the Drawings 

Other objects and advantages of the invention will become apparent upon reading the following detailed 
description and upon reference to the accompanying drawings in which: 

Fig. 1 is a block diagram of a microprocessor; 
10 Figs. 2A-C illustrate the format and operation of a minimum value instruction according to one 

embodiment of the invention; 

Figs. 3A-C illustrate the format and operation of a maximum value instruction according to one 
embodiment of the invention; 

Figs. 4A-C illustrate the format and operation of an equality compare instruction according to one 
1 5 embodiment of the invention; 

Figs. 5A-C illustrate the format and operation of a greater than compare instruction according to one 
embodiment of the invention; 

Figs. 6A-C illustrate the format of an greater than or equal to compare instruction according to one 
embodiment of the invention; 
20 Figs. 7A-B illustrate the format of the operands utilized by the instructions depicted in Figs. 2-6 

according to one embodiment of the invention; 

Fig. 8 is a block diagram of an execution unit configured to execute the instructions depicted in Figs. 2-6 
according to one embodiment of the invention; and 

Fig. 9 is a block diagram of a vector execution unit configured to execute the instructions depicted in 
25 Figs. 2-6 according to one embodiment of the invention. 
Detailed Description of the Invention 

Turning now to Fig. 1, a block diagram of one embodiment of a microprocessor 10 is shown. As 
depicted, microprocessor 10 includes a predecode logic block 12 coupled to an instruction cache 14 (which 
includes an instruction TLB 16). A cache controller 18 is coupled to both predecode block 12 and instruction 

30 cache 14, as well as a bus interface unit 24 and a data cache 26 (which includes a data TLB 28). Microprocessor 
10 further includes a decode unit 20, which receives instructions from instruction cache 14 which are forwarded 
to execution engine 30 in accordance with input received from a branch logic unit 22. 

Execution engine 30 includes a scheduler buffer 32 coupled to receive input from decode unit 20. 
Scheduler buffer 32 is coupled to convey decoded instructions to a plurality of execution units 36A-G in 

35 accordance with input received from an instruction control unit 34. Execution units 36A-G include a load unit 
36A, a store unit 36B, an integer X unit 36C, a multimedia unit 36D, an integer Y unit 36E, a floating point unit 
36F, and a branch resolving unit 36G. Load unit 36A receives input from data cache 26, while store unit 36B 
interfaces with data cache 26 via a store queue 38. Blocks referred to herein with a reference number followed by 
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a letter will be collectively referred to by the reference number alone. For example, execution units 36A- G wil l 
be collectively referred to as execution units 36. 

Generally speaking, multimedia execution unit 36D within microprocessor 10 is configured to provide 
an efficient implementation for extreme value instructions. As will be described in greater detail below, 
5 execution unit 36D utilizes hardware which performs compare operations in order to also perform minimum and 
maximum value instructions. In this manner, execution unit 36D advantageously implements these extreme value 
instructions as dedicated, single-cycle, instructions, thereby increasing the performance of applications such as 
three-dimensional graphics rendering and audio processing. 

In one embodiment, instruction cache 14 is organized as sectors, with each sector including two 32-byte 

10 cache lines.. The two cache lines of a sector share a common tag but have separate state bits that track the status 
of the line. Accordingly, two forms of cache misses (and associated cache fills) may take place: sector 
replacement and cache line replacement. In the case of sector replacement, the miss is due to a tag mismatch in 
instruction cache 14, with the required cache line being supplied by external memory via bus interface unit 24. 
The cache line within the sector that is not needed is then marked invalid. In the case of a cache line replacement, 

15 the tag matches the requested address, but the line is marked as invalid. The required cache line is supplied by 
external memory, but, unlike the sector replacement case, the cache line within the sector that was not requested 
remains in the same state. In alternate embodiments, other organizations for instruction cache 14 may be utilized, 
as well as various replacement policies. 

Microprocessor 10 performs prefetching only in the case of sector replacements in one embodiment. 

20 During sector replacement, the required cache line is filled. If this required cache line is in the first half of the 
sector, the other cache line in the sector is prefetched. If this required cache line is in the second half of the 
sector, no prefetching is performed. It is noted that other prefetching methodologies may be employed in 
different embodiments of microprocessor 10. 

When cache lines of instruction data are retrieved from external memory by bus interface unit 24, this 

25 data is conveyed to predecode logic block 12. in one embodiment, the instructions processed by microprocessor 
10 and stored in cache 14 are variable- length (e.g., the x86 instruction set). Because decode of variable- length 
instructions is particularly complex, predecode logic 12 is configured to provide additional information to be 
stored in instruction cache 14 to aid during decode. In one embodiment, predecode logic 12 generates predecode 
bits for each byte in instruction cache 14 which indicate the number of bytes to the start of the next variable- 

30 length instruction. These predecode bits are passed to decode unit 20 when instruction bytes are requested from 
cache 14. 

Instruction cache 14 is implemented as a 32Kbyte, two-way set associative, writeback cache in one 
embodiment of microprocessor 10. The cache line size is 32 bytes in this embodiment. Cache 14 also includes a 
64-entry TLB used to translate linear addresses to physical addresses. Many other variations of instruction cache 
35 14 are possible in other embodiments. 

Instruction fetch addresses are supplied by cache controller 18 to instruction cache 14. In one 
embodiment, up to 16 bytes per clock cycle may be fetched from cache 14. The fetched information is placed 
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into an instruction buffer that feeds into decode unit 20. In one embodiment of microprocessor 10, fetchingjnjay 
occur along a single execution stream with seven outstanding branches taken. 

In one embodiment, the instruction fetch logic within cache controller 18 is capable of retrieving any 16 
contiguous instruction bytes within a 32-byte boundary of cache 14. There is no additional penalty when the 16 
5 bytes cross a cache line boundary. Instructions are loaded into the instruction buffer as the current instructions 
are consumed by decode unit 20. Other configurations of cache controller 1 8 are possible in other embodiments. 

Decode logic 20 is configured to decode multiple instructions per processor clock cycle. In one 
embodiment, decode unit 20 accepts instruction and predecode bytes from the instruction buffer (in x86 format), 
locates actual instruction boundaries, and generates corresponding "RISC ops". RISC ops are fixed-format 
10 internal instructions, most of which are executable by microprocessor 10 in a single clock cycle. RISC ops are 
combined to form every function of the x86 instruction set in one embodiment of microprocessor 10. 

Microprocessor 10 uses a combination of decoders to convert x86 instructions into RISC ops. The 
hardware includes three sets of decoders: two parallel short decoders, one long decoder, and one vectoring 
decoder. The parallel short decoders translate the most commonly-used x86 instructions (moves, shifts, branches, 
15 etc.) into zero, one, or two RISC ops each. The short decoders only operate on x86 instructions that are up to 
seven bytes long. In addition, they are configured to decode up to two x86 instructions per clock cycle. The 
commonly-used x86 instructions which are greater than seven bytes long, as well as those semi-commonly-used 
instructions are up to seven bytes long, are handled by the long decoder. 

The long decoder in decode unit 20 only performs one decode per clock cycle, and generates up to four 
20 RISC ops. All other translations (complex instructions, interrupts, etc.) are handled by a combination of the 
vector decoder and RISC op sequences fetched from an on-chip ROM. For complex operations, the vector 
decoder logic provides the first set of RISC ops and an initial address to a sequence of further RISC ops. The 
RISC ops fetched from the on-chip ROM are of the same type that are generated by the hardware decoders. 

In one embodiment, decode unit 20 generates a group of four RISC ops each clock cycle. For clock 
25 cycles in which four RISC ops cannot be generated, decode unit 20 places RISC NOP operations in the remaining 
slots of the grouping. These groupings of RISC ops (and possible NOPs) are then conveyed to scheduler buffer 
32. 

It is noted that in another embodiment, an instruction format other than x86 may be stored in instruction 
cache 14 and subsequently decoded by decode unit 20. 

30 Instruction control logic 34 contains the logic necessary to manage out-of-order execution of instructions _ 

stored in scheduler buffer 32. Instruction control logic 34 also manages data forwarding, register renaming, 
simultaneous issue and retirement of RISC ops, and speculative execution. In one embodiment, scheduler buffer 
32 holds up to 24 RISC ops at one time, equating to a maximum of 12 x86 instructions. When possible, 
instruction control logic 34 may simultaneously issue (from buffer 32) a RISC op to any available one of 

35 execution units 36. In total, control logic 34 may issue up to six and retire up to four RISC ops per clock cycle in 
one embodiment. 

In one embodiment, microprocessor 10 includes seven execution units (36A-G). Store unit 36A and 
load unit 36B are two-staged pipelined designs. Store unit 36A performs data memory and register writes which 
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are available for loading after one clock cycle. Load unit 36B performs memory reads. The data from these 

reads is available after two clock cycles. Load and store units are possible in other embodiments with varying 
latencies. 

Execution unit 36C (Integer X unit) is a fixed point execution unit which is configured to operate on all 
5 ALU operations, as well as multiplies, divides (both signed and unsigned), shifts, and rotates. In contrast, 

execution unit 36E (Integer Y unit) is a fixed point execution unit which is configured to operate on the basic 
word and doubleword ALU operations (ADD, AND, CMP, etc.). 

Execution unit 36D (multimedia unit) is configured to accelerate performance of software written using 
multimedia instructions. Applications that can take advantage of multimedia instructions include graphics, video 

1 0 and audio compression and decompression, speech recognition, and telephony. Execution unit 36D is configured 
to execute multimedia instructions in a single clock cycle in one embodiment. Many of these instructions are 
designed to perform the same operation of multiple sets of data at once (vector processing). In one embodiment, 
multimedia unit 36D uses registers which are mapped on to the stack of floating point unit 36F. 

Execution unit 36F contains an IEEE 754-compatible floating point unit designed to accelerate the 

15 performance of software which utilizes the x86 instruction set. Floating point software is typically written to 

manipulate numbers that are either very large or small, require a great deal of precision, or result from complex 
mathematical operations such as transcendentals. Floating point unit includes an adder unit, a multiplier unit, and 
a divide/square root unit. In one embodiment, these low-latency units are configured to execute floating point 
instructions in as few as two clock cycles. 

20 Execution unit 36G (the branch resolving unit) is separate from branch prediction logic 22 in that it 

resolves conditional branches such as JCC and LOOP after the branch condition has been evaluated. Branch 
resolving unit 36G allows efficient speculative execution, enabling microprocessor 10 to execute instructions 
beyond conditional branches before knowing whether the branch prediction was correct. As described above, 
microprocessor 10 is configured to handle up to seven outstanding branches in one embodiment. 

25 Branch prediction logic 22, coupled to decode unit 20, is configured to increase the accuracy with which 

conditional branches are predicted in microprocessor 10. Ten to twenty percent of the instructions in typical 
applications include conditional branches. Branch prediction logic 22 is configured to handle this type of 
program behavior and its negative effects on instruction execution, such as stalls due to delayed instruction 
fetching. In one embodiment, branch prediction logic 22 includes an 8192-entry branch history table, a 16-entry 

30 by 16 byte branch target cache, and a 16-entry return address stack. 

Branch prediction logic 22 implements a two-level adaptive history algorithm using the branch history 
table. This table stores executed branch information, predicts individual branches, and predicts behavior of 
groups of branches. In one embodiment, the branch history table does not store predicted target addresses in 
order to save space. These addresses are instead calculated on-the-fly during the decode stage. 

35 To avoid a clock cycle penalty for a cache fetch when a branch is predicted taken, a branch target cache 

within branch logic 22 supplies the first 16 bytes at that address directly to the instruction buffer (if a hit occurs in 
the branch target cache). In one embodiment, this branch prediction logic achieves branch prediction rates of 
over 95%. 
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Branch logic 22 also includes special circuitry designed to optimize the CALL and RET instructio ns. 
This circuitry allows the address of the next instruction following the CALL instruction in memory to be pushed 
onto a return address stack. When microprocessor 10 encounters a RET instruction, branch logic 22 pops this 
address from the return stack and begins fetching. 
5 Like instruction cache 14, data cache 26 is also organized as two-way set associative 32Kbyte storage. 

In one embodiment, data TLB 28 includes 128 entries used to translate linear to physical addresses. Like 
instruction cache 14, data cache 26 is also sectored. Data cache 26 implements a MESI (modified-exclusive- 
shared-invalid) protocol to track cache line status, although other variations are also possible. 

Turning now to Fig. 2 A, the format of a floating point minimum value instruction ("PFMIN") 100 is 
10 shown according to one embodiment of the invention. As depicted, PFMIN instruction 100 includes an opcode 
value 101 and two operands, first operand field 102 A and first operand field 102B. The value specified by first 
-operand field 102 A is shown as being "mmregl", which, in one embodiment, is one of the registers on the stack 
of floating point execution unit 36F. The value specified by second operand field 102B is shown as either being 
another of the floating point stack registers or a memory location. In another embodiment, second operand field 
1 5 102B specifies an immediate value is an immediate value. 

In one embodiment, instruction 100 (and other instructions to be described below with reference to Figs. 
3 A, 4A, 5 A, and 6A) specifies operands (such as the values specified by operand fields 102) having more than 
one independent value within a given register which is specified as an operand. That is, registers such as mmregl 
specified in Fig. 2A are vector registers. 
20 The format of such a register 600 is shown in Fig. 7A. Register 600 includes two separate vector 

quantities, first vector value 602A and second vector value 602B. In one embodiment, all of the floating point 
registers in execution unit 36F which are accessible by instruction 100 and other instructions described herein are 
organized in a similar manner. Vector values 602 each include a 32-bit single-precision floating point in one 
embodiment. In other embodiments, vector values 602 may be stored in other numerical representations, such as 
25 a fixed point format. 

The format of the single-precision values stored in vector values 603 is depicted in Fig. 7B. As shown, 
format 610 (which corresponds to IEEE floating point format) includes a sign bit 612 (S), an exponent value 614 
(E), and a significand value 616 (F). The value of a number V represented in format 610 can thus be represented 
by 

30 V-(-l) s *2 E - bias *(l.F). 

Other floating point formats are possible for vector values 602 in other embodiments. 

Turning now to Fig. 2B, pseudocode illustrating operation of PFMIN instruction 100 is given. As 

shown, upon execution of PFMIN instruction 100, a comparison of a first vector portion (such as value 602 A) of 

the value specified by first operand field 102 A and a first vector portion of the value second operand 102B is 
35 performed. Concurrently, a comparison of a second vector portion (such as value 602B) of the value specified by 

first operand field 1 02 A and a second vector portion of the value specified by second operand field 102B is also 

performed. 
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If the first vector portion of the value specified by first operand field 102 A is found to be less tha n the 
first vector portion of the value specified by second operand field I02B, the value of the first vector portion of the 
value specified by first operand field 102 A is conveyed as a first portion of a result of instruction 100. Otherwise, 
the value of the first vector portion of value specified by second operand field 102B is conveyed as the first 
5 vector portion of the result of instruction 100. Similarly, if the second vector portion of the value specified by 
first operand field 102 A is found to be less than the second vector portion of the value specified by second 
operand field 1 02B, the value of the second vector portion of the value specified by first operand field 1 02 A is 
conveyed as a second portion of a result of instruction 100. Otherwise, the value of the second vector portion of 
value specified by second operand field 102B is conveyed as the second vector portion of the result of instruction 
10 100. This sequence of operations effectuates execution of the minimum value function. Fig. 2C is a table which 
shows the output of instruction 100 given various inputs, including cases in which operands 102 are zero or in 
unsupported formats. 

The result (both the first and second vector portions) of instruction 100 is subsequently written to 
register mmregl within floating point execution unit 36F. In another embodiment of instruction 100, the result 

15 value may be stored to mmreg2, a memory location, or a third register specified by an additional operand. It is 
noted that in other embodiments of instruction 100, the operands are not vectored and thus include only a single 
value. It is further noted that in still other embodiments of operands 102, these values may include additional 
vector values beyond the two vector values shown in Fig. 7A. 

Turning now to Fig. 3 A, the format of a floating point maximum value instruction ("PFMAX") 200 is 

20 shown according to one embodiment of the invention. The format of PFMAX instruction 200 is similar to that 
described above for PFMIN instruction 100. As depicted, PFMAX instruction 200 includes an opcode value 201 
and two operands, first operand field 202A and first operand field 202B. The value specified by first operand 
field 202 A is shown as being "mmregl", which, in one embodiment, is one of the registers on the stack of 
floating point execution unit 36F. The value specified by second operand field 202B is shown as either being 

25 another of the floating point stack registers or a memory location. In another embodiment, second operand field 
202B specifies an immediate value. 

Turning now to Fig. 3B, pseudocode illustrating operation of PFMAX instruction 200 is given. As 
shown, upon execution of PFMAX instruction 200, a comparison of a first vector portion (such as value 602 A) of 
the value specified by first operand field 202A and a first vector portion of the value second operand 202B is 

30 performed. Concurrently, a comparison of a second vector portion (such as value 602B) of the value specified by 
first operand field 202A and a second vector portion of the value specified by second operand field 202B is also 
performed. 

If the first vector portion of the value specified by first operand field 202A is found to be greater than the 
first vector portion of the value specified by second operand field 202B, the value of the first vector portion of the 
35 value specified by first operand field 202 A is conveyed as a first portion of a result of instruction 200. Otherwise, 
the value of the first vector portion of value specified by second operand field 202B is conveyed as the first 
vector portion of the result of instruction 200. Similarly, if the second vector portion of the value specified by 
first operand field 202A is found to be greater than the second vector portion of the value specified by second 
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operand field 202B, the value of the second vector portion of the value specified by first operand field 202 A is 
conveyed as a second portion of a result of instruction 200. Otherwise, the value of the second vector portion of 
value specified by second operand field 202B is conveyed as the second vector portion of the result of instruction 
200. This sequence of operations effectuates execution of the maximum value function. Fig. 3C is a table which 
5 shows the output of instruction 200 given various inputs, including cases in which operands 202 are zero or in 
unsupported formats. 

The result (both the first and second vector portions) of instruction 200 is subsequently written to 
register mmregl within floating point execution unit 36F. In another embodiment of instruction 200, the result 
value may be stored to mmreg2, a memory location, or a third register specified by an additional operand. It is 
10 noted that in other embodiments of instruction 200, the operands are not vectored and thus include only a single 
value. It is further noted that in still other embodiments of operands 202, these values may include additional 
vector values beyond the two vector values shown in Fig. 7A. 

Turning now to Fig. 4A, the format of a floating point equality compare instruction ("PFCMPEQ") 300 
is shown according to one embodiment of the invention. The format of PFCMPEQ instruction 300 is similar to 

15 that described above for instructions 100 and 200. As depicted, PFCMPEQ instruction 300 includes an opcode 
value 301 and two operands, first operand field 302 A and first operand field 302B. The value specified by first 
operand field 302A is shown as being "mmregl", which, in one embodiment, is one of the registers on the stack 
of floating point execution unit 36F. The value specified by second operand field 302B is shown as either being 
another of the floating point stack registers or a memory location. In another embodiment, second operand field 

20 302B specifies an immediate value. 

Turning now to Fig. 4B, pseudocode illustrating operation of PFCMPEQ instruction 300 is given. As 
shown, upon execution of PFCMPEQ instruction 300, a comparison of a first vector portion (such as value 602A) 
of the value specified by first operand field 302 A and a first vector portion of the value second operand 302B is 
performed. Concurrently, a comparison of a second vector portion (such as value 602B) of the value specified by 

25 first operand field 302A and a second vector portion of the value specified by second operand field 302B is also 
performed. 

If the first vector portion of the value specified by first operand field 302A is found to be equal to the 
first vector portion of the value specified by second operand field 302B, the value of the first vector portion of the 
value specified by first operand field 302A is conveyed as a first portion of a result of instruction 300. Otherwise, 

30 the value of the first vector portion of value specified by second operand field 302B is conveyed as the first 

vector portion of the result of instruction 300. Similarly, if the second vector portion of the value specified by 
first operand field 302A is found to be equal to the second vector portion of the value specified by second 
operand field 302B, the value of the second vector portion of the value specified by first operand field 302A is 
conveyed as a second portion of a result of instruction 300. Otherwise, the value of the second vector portion of 

35 value specified by second operand field 302B is conveyed as the second vector portion of the result of instruction 
300. This sequence of operations effectuates execution of the equality compare function. Fig. 4C is a table which 
shows the output of instruction 300 given various inputs, including cases in which operands 302 are zero or in 
unsupported formats. 
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The result (both the first and second vector portions) of instruction 300 is subsequently written to 

register mmregl within floating point execution unit 36F. In another embodiment of instruction 300, the result 
value may be stored to mmreg2, a memory location, or a third register specified by an additional operand. It is 
noted that in other embodiments of instruction 300, the operands are not vectored and thus include only a single 
5 value. It is further noted that in still other embodiments of operands 302, these values may include additional 
vector values beyond the two vector values shown in Fig. 7A. 

Turning now to Fig. 5A, the format of a floating point greater than compare instruction ("PFCMPGT") 
400 is shown according to one embodiment of the invention. The format of PFCMPGT instruction 400 is similar 
to that described above for instructions 100, 200, and 300. As depicted, PFCMPGT instruction 400 includes an 

10 opcode value 401 and two operands, first operand field 402A and first operand field 402B. The value specified by 
first operand field 402 A is shown as being "mmregl", which, in one embodiment, is one of the registers on the 
stack of floating point execution unit 36F. The value specified by second operand field 402B is shown as either 
being another of the floating point stack registers or a memory location. In another embodiment, second operand 
field 402B specifies an immediate value. 

15 Turning now to Fig. 5B, pseudocode illustrating operation of PFCMPGT instruction 400 is given. As 

shown, upon execution of PFCMPGT instruction 400, a comparison of a first vector portion (such as value 602A) 
of the value specified by first operand field 402 A and a first vector portion of the value second operand 402B is 
performed. Concurrently, a comparison of a second vector portion (such as value 602B) of the value specified by 
first operand field 402A and a second vector portion of the value specified by second operand field 402B is also 

20 performed. 

If the first vector portion of the value specified by first operand field 402 A is found to be greater than the 
first vector portion of the value specified by second operand field 402B, the value of the first vector portion of the 
value specified by first operand field 402A is conveyed as a first portion of a result of instruction 400. Otherwise, 
the value of the first vector portion of value specified by second operand field 402B is conveyed as the first 

25 vector portion of the result of instruction 400. Similarly, if the second vector portion of the value specified by 
first operand field 402A is found to be greater than the second vector portion of the value specified by second 
operand field 402B, the value of the second vector portion of the value specified by first operand field 402A is 
conveyed as a second portion of a result of instruction 400. Otherwise, the value of the second vector portion of 
value specified by second operand field 402B is conveyed as the second vector portion of the result of instruction 

30 400. This sequence of operations effectuates execution of the greater than compare function. Fig. 5C is a table 

which shows the output of instruction 400 given various inputs, including cases in which operands 402 are zero or 
in unsupported formats. 

The result (both the first and second vector portions) of instruction 400 is subsequently written to 
register mmregl within floating point execution unit 36F. In another embodiment of instruction 400, the result 
35 value may be stored to mmreg2, a memory location, or a third register specified by an additional operand. It is 
noted that in other embodiments of instruction 400, the operands are not vectored and thus include only a single 
value. It is further noted that in still other embodiments of operands 402, these values may include additional 
vector values beyond the two vector values shown in Fig. 7A. 
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Turning now to Fig. 6 A, the format of a floating point greater than or equal to compare instructi on 
("PFCMPGE") 500 is shown according to one embodiment of the invention. The format of PFCMPGE 
instruction 500 is similar to that described above for instructions 100, 200, 300, and 400. As depicted, 
PFCMPGE instruction 500 includes an opcode value 401 and two operand fields, first operand field 5 02 A and 
first operand field 502B. The value specified by first operand field 502A is shown as being "mmregl", which, in 
one embodiment, is one of the registers on the stack of floating point execution unit 36F. The value specified by 
second operand field 502B is shown as either being another of the floating point stack registers or a memory 
location. In another embodiment, second operand field 502B specifies an immediate value. 

Turning now to Fig. 6B, pseudocode illustrating operation of PFCMPGE instruction 500 is given. As 
shown, upon execution of PFCMPGE instruction 500, a comparison of a first vector portion (such as value 602A) 
of the value specified by first operand field 502A and a first vector portion of the value second operand 502B is 
performed. Concurrently, a comparison of a second vector portion (such as value 602B) of the value specified by 
first operand field 502A and a second vector portion of the value specified by second operand field 502B is also 
performed. 

If the first vector portion of the value specified by first operand field 502A is found to be greater than or 
equal to the first vector portion of the value specified by second operand field 502B, the value of the first vector 
portion of the value specified by first operand field 502A is conveyed as a first portion of a result of instruction 
500. Otherwise, the value of the first vector portion of value specified by second operand field 502B is conveyed 
as the first vector portion of the result of instruction 500. Similarly, if the second vector portion of the value 
specified by first operand field 502A is found to be greater than the second vector portion of the value specified 
by second operand field 502B, the value of the second vector portion of the value specified by first operand field 
502A is conveyed as a second portion of a result of instruction 500. Otherwise, the value of the second vector 
portion of value specified by second operand field 502B is conveyed as the second vector portion of the result of 
instruction 500. This sequence of operations effectuates execution of the greater than or equal to compare 
function. Fig. 6C is a table which shows the output of instruction 500 given various inputs, including cases in 
which operands 402 are zero or in unsupported formats. 

The result (both the first and second vector portions) of instruction 500 is subsequently written to 
register mmregl within floating point execution unit 36F. In another embodiment of instruction 500, the result 
value may be stored to mmreg2, a memory location, or a third register specified by an additional operand. It is 
noted that in other embodiments of instruction 500, the operands are not vectored and thus include only a single 
value. It is further noted that in still other embodiments of operands 502, these values may include additional 
vector values beyond the two vector values shown in Fig. 7A. 

It is noted that various other extreme value and compare instructions may be implemented in other 
embodiments. 

Turning now to Fig. 8, a block diagram of multimedia execution unit 36D is shown according to one 
embodiment of the invention. As depicted, execution unit 36D includes input registers 702A and 702B. Register 
702 A is coupled to floating point unit 36F by a first operand bus 71 OA. Likewise, register 702B is coupled to 
floating point unit 36F by a second operand bus 710B. Second operand bus 710B also couples input register 
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702B to receive operands via decode unit 20 and memory in one embodiment. Execution unit 36D further 

includes a comparator unit 730, which is coupled to receive decoded opcode values on decoded opcode bus 720 
from decode unit 20. Comparator unit is further coupled to receive compare inputs (labeled "a" and "b") from 
register output buses 704A-B, coupled to the outputs of registers 702A and 702B, respectively. Comparator unit 
5 730 conveys a select bus 732A as output. 

Select bus 732A is coupled to output multiplexer 740A in order to select one of the inputs to multiplexer 
740A as an output on result bus 742. The inputs to multiplexer 740A include a first constant value 734A, a 
second constant value 734B, and register output buses 704A and 704B. The output of multiplexer 740A is stored 
in an output register 750, and subsequently forwarded to a result destination, such as a floating point register 

10 within execution unit 36F. 

Decoded versions of opcodes recognized by microprocessor 10 are conveyed from decode unit 20 to 
comparator unit 730 on decoded opcode bus 720. If the value on bus 720 is an opcode that corresponds to an 
extreme value function or a compare function (i.e., opcodes such as opcode 101, opcode 201, etc.), comparator 
unit is configured to assert signals on select bus 732A during the current clock cycle. These signals are used to 

1 5 select the output of multiplexer 740A as described below. If the value on bus 720 does not correspond to an 
extreme value function or compare function, comparator unit is inactive for the current clock cycle. 

Concurrently with decoded opcode values being conveyed to execution unit 36D, first and second 
operands are conveyed to unit 36D on buses 71 OA and 71 0B, respectively. As depicted, the first operand is 
conveyed to an input register 702 A on bus 710A. The first operand is selected according to a value of the first 

20 operand field as shown above. For example, the value of the first operand field as stored in instruction cache 14 
may specify a particular register location. Subsequent logic in decode unit 20 and instruction control logic 34 is 
responsible for forwarding the value stored in the particular register location on bus 71 OA. The second operand 
of the instruction is conveyed similarly on bus 710B and stored in register 702B. In the embodiment shown in 
Fig. 8, the operands stored in registers 702 are non-vectored; that is, they each contain only a single independent 

25 value. An execution unit which processes vectored operands is described below with reference to Fig. 9. 

In the same clock cycle that the decoded opcode value is provided to comparator unit 730 on decoded 
opcode bus 720, the first and second operands are also conveyed as compare inputs to unit 730 on buses 704A-B. 
Comparator unit performs a comparison of the two operands, and, according to the decoded opcode value on bus 
720, conveys values on bus 732A which select one of the inputs to multiplexer 740A. The inputs to multiplexer 

30 740A include first constant value 734A, second constant value 734B, and the values of the first and second 
operands on buses 704 A-B. These inputs are labeled as "H"-"J4" i n pig, 8. 

If the decoded opcode value indicates a compare operation, the values on select bus 732A select between 
constants 734 according to the result of the compare operation. These constants may be any value that is useful to 
have written to a register the result of a compare. On the other hand, if the decoded opcode value indicates a 

35 extreme value function (minimum or maximum value), the values on select bus 732 A select between the operand 
values on buses 704 according to the results of the compare. The particular inputs selected from multiplexer 
740A are shown in Table 1 for the various compare results and the instructions described above. 
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Table 1 

The output of multiplexer 740A is conveyed on result bus 742 to an output register 750 as the result of 
the instruction corresponding to the decoded opcode value received on bus 720. This result may be forwarded to 
5 a variety of destinations. As described above with reference to instructions 100, 200, 300, 400, and 500, the 

result of the instruction is written to the register location in floating point unit 36F which stores the first operand. 
In other embodiments, the result may be written to a different register specified by an additional operand value. 

Turning now to Fig. 9, a multimedia execution unit 37D is shown. Execution unit 37D is similar in 
structure to execution unit 36D; therefore, logic blocks and buses in execution unit 37D which operate similarly 

10 to those in Fig. 8 are numbered identically for convenience and clarity. Unlike execution unit 36D, however, unit 
37D is configured to efficiently process vector operands. Execution unit 37D is interchangeable with execution 
unit 36D within the context of microprocessor 10 shown in Fig. 1. 

As depicted, execution unit 37D includes input registers 701 A and 70 IB. Register 701 A is coupled to 
floating point unit 36F by a first operand bus 710A. Likewise, register 702B is coupled to floating point unit 36F 

15 by a second operand bus 71 0B. Second operand bus 71 0B also couples input register 70 IB to receive operands 
via decode unit 20 and memory in one embodiment. Unlike input registers 702 shown in Fig. 8, input register 
701 are vectored. As shown, register 701 A includes a first vector portion 703 A and a second vector portion 
703 B. Likewise, register 70 IB includes a first vector portion 703C and a second vector portion 703 D. Execution 
unit 37D further includes a comparator unit 731, which is similar to comparator unit 730 shown in Fig. 8 in that 

20 comparator unit 73 1 also receives a decode opcode value on bus 720. Comparator unit 73 1 , however, is 

configured to perform two comparisons and convey two sets of select bus signals concurrently. Unit 73 1 receives 
a first set of signals (at inputs labeled "a," and "b") which correspond to first vector portions 703A and 703C, as 
well as a second set of signals (at inputs labeled "a2" and "b2") which correspond to second vector portions 703B - 
and 703D. In response to the comparison performed for inputs a, and b,, comparator unit 731 conveys select bus 

25 732A as output to multiplexer 740A. Similarly, comparator 73 1 conveys values on a select bus 732B to 
multiplexer 740B in response to the comparison performed for the inputs labeled a 2 and b 2 . 

Select bus 732 A is coupled to output multiplexer 740A in order to select one of the inputs to multiplexer 
740A as an output on result bus 743 A. The inputs to multiplexer 740A include first constant value 734A, second 
constant value 734B, and first vector portions 703 A and 703C. The output of multiplexer 740A is conveyed on a 

30 result bus 743 A to be stored in a first vector portion 752A of an output register 75 1 . 
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Similarly, select bus 732B is coupled to an output multiplexer 740B in order to select a multiplex er 74 0B 
input for output. Inputs to multiplexer 740B include constant values 734, and second vector portions 703B and 
703D. The selected output of multiplexer 740B is conveyed on a result bus 743B to be stored in a second vector 
portion 752B of output register 75 1. 
5 The first and second vector portions of register 75 1 are subsequently forwarded to a result destination, 

such as a floating point register within execution unit 36F. 

Extreme value instructions 100 and 200 are efficiently implemented in microprocessor 10. By 
modifying comparator and multiplexer hardware used for compare operations, minimum and maximum value 
instructions 100 and 200 may be added to the instruction set of microprocessor 10 with minimal overhead. The 
10 performance of microprocessor 10 in applications which commonly use these extreme value functions is thus 
advantageously increased. 

Numerous variations and modifications will become apparent to those skilled in the art once the above 
disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such 
variations and modifications. 
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What is Claimed: 

1. A microprocessor configured to execute a first instruction, wherein an encoded representation 
of said first instruction includes an opcode field, a first operand field, and a second operand field, said 
microprocessor comprising: 

an execution unit coupled to receive a decoded value of said opcode field, a first operand 
specified by a value of said first operand field, and a second operand specified by a value of said second operand 
field, wherein said execution unit is configured to perform an extreme value operation on said first operand and 
said second operand in response to receiving said decoded value of said opcode field; 

and wherein said execution unit is further configured to convey an output value of said extreme 
value operation as a result of said first instruction. 

2. The microprocessor of claim 1, wherein said extreme value operation is a minimum value 
operation, and wherein said execution unit is configured to determine which of said first operand and said second 
operand has a lesser value in response to receiving a decoded value of said opcode field indicative of said 
minimum value operation, and wherein said execution unit is configured to convey said lesser value as said result 
of said first instruction. 

3. The microprocessor of claim 1, wherein said extreme value operation is a maximum value 
operation, and wherein said execution unit is configured to determine which of said first operand and said second 
operand has a greater value in response to receiving a decoded value of said opcode field indicative of said 
maximum value operation, and wherein said execution unit is configured to convey said greater value as said 
result of said first instruction. 

4. The microprocessor of claim 1, further comprising a register file which includes a first register, 
wherein said first register stores said first operand, and wherein said value of said first operand field specifies said 
first register, and wherein said register file is configured to convey said first operand to said execution unit in 
response to receiving said value of said first operand field. 

5. The microprocessor of claim 4, wherein said execution unit is further configured to convey said 
result of said first instruction to said register file, and wherein said register file is configured to store said result as 
a new value of said first register. 

6. The microprocessor of claim 5, wherein said register file further includes a second register 
which stores said second operand, and wherein said value of said second operand field specifies said second 
register, and wherein said register file is configured to convey said second operand to said execution unit in 
response to receiving said value of said second operand field. 
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7. The microprocessor of claim 5, further comprising a memory unit which includes a first 

memory location, wherein said first memory location stores said second operand, and wherein said value of said 
second operand specifies said first memory location, and wherein said memory unit is configured to convey said 
second operand to said execution unit in response to receiving said value of said second operand field. 

5 

8. The microprocessor of claim 5, wherein said second operand is an immediate value. 

9. The microprocessor of claim 8, wherein said immediate value is specified by said value of said 
second operand field. 

10 

10. The microprocessor of claim 1, wherein said first operand and second operand are floating point 

numbers. 

1 1 . The microprocessor of claim 1 , wherein said first operand includes a first vector value followed by a 
15 second vector value, and wherein said second operand includes a third vector value followed by a fourth vector 

value, and wherein said execution unit is configured to perform said extreme value operation on said first vector 
value and said third vector value, thereby generating a first vector portion of said output value conveyed as a 
result of said first instruction. 

20 12. The microprocessor of claim 9, wherein said execution unit is configured to perform said 

extreme value operation on said second vector value and said fourth vector value, thereby generating a second 
vector portion of said output value conveyed as a result of said first instruction, whereingeneration of said first 
vector portion is performed concurrently with generation of said second vector portion. 

25 13. An execution unit in a microprocessor for executing a first instruction, wherein an encoded 

representation of said first instruction includes an opcode field, a first operand field, and a second operand field, 
said execution unit comprising: 

a first input register coupled to receive a first operand specified by a value of said first operand 

field; 

30 a second input register coupled to receive a second operand specified by a value of said second 

operand field; 

a comparator unit coupled to receive said first operand from said first input register and said 
second operand from said second input register, and wherein said comparator unit is coupled to receive a decoded 
opcode value of said opcode field on a decoded opcode bus; 
35 a multiplexer coupled to receive a plurality of inputs including said first operand from said first 

input register, said second operand from said second input register, a first constant value, and a second constant 
value, and wherein said multiplexer is configured to select one of said plurality of inputs to be conveyed as a 
result of said first instruction in response to receiving one or more control signals from said comparator unit; 
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and wherein said comparator unit is configured to generate said one or more control signa ls in 
response to receiving said decoded opcode value, and wherein, if said decoded opcode value indicates that said 
first instruction is one of a plurality of extreme value instructions, said one or more control signals are usable to 
select either said first operand or said second operand as said output value, and wherein, if said decoded opcode 
value indicates that said first instruction is one of a plurality of compare instructions, said one or more control 
signals are usable to select either said first constant value or said second constant value as said output value. 

14. The execution unit of claim 13, wherein said plurality of extreme value instructions includes a 
minimum value instruction and a maximum value instruction. 

15. The execution unit of claim 14, wherein said first instruction is said minimum value instruction, 
and wherein said comparator unit is configured to determine which of said first operand and said second operand 
has a lesser value in response to receiving a decoded opcode value indicative of said minimum value instruction, 
and wherein said comparator unit is further configured to convey a first set of control signal values to said 
multiplexer which specify said lesser value, and wherein said multiplexer is configured to convey said lesser 
value as said result of said first instruction in response to receiving said first set of control signal values. 

16. The execution unit of claim 14, wherein said first instruction is said maximum value instruction, 
and wherein said comparator unit is configured to determine which of said first operand and said second operand 
has a greater value in response to receiving a decoded opcode value indicative of said maximum value instruction, 
and wherein said comparator unit is further configured to convey a second set of control signal values to said 
multiplexer which specify said greater value, and wherein said multiplexer is configured to convey said greater 
value as said result of said first instruction in response to receiving said second set of control signal values. 

17. The execution unit of claim 14, wherein said plurality of compare instructions includes a greater 
than or equal to instruction, a greater than instruction, and an equality instruction. 

18. The execution unit of claim 17, wherein said first instruction is said greater than or equal to 
instruction, and wherein said comparator unit is configured to determine if said first operand is greater than or 
equal to said second operand in response to receiving a decoded opcode value indicative of said greater than or 
equal to instruction, and wherein said comparator unit is further configured to convey a third set of control signal 
values to said multiplexer which specify either said first constant value or said second constant value. 

19. The execution unit of claim 18, wherein said first operand is greater than or equal to said second 
operand, and wherein said third set of control signal values specify said first constant value, and wherein said first 
constant value is conveyed as said result of said first instruction in response to receiving said third set of control 
signal values. 
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20. The execution unit of claim 19, wherein said first constant value includes a plurality of bi ts, 
wherein all of said plurality of bits are set, and wherein said first constant value is usable as a mask by an 
instruction executed subsequent to said first instruction. 

21 . The execution unit of claim 18, wherein said first operand is less than said second operand, and 
wherein said third set of control signal values specify said second constant value, and wherein said second 
constant value is conveyed as said result of said first instruction in response to receiving said third set of control 
signal values. 

22. The execution unit of claim 2 1 , wherein said second constant value includes a plurality of bits, 
wherein none of said plurality of bits are set, and wherein said second constant value is usable as a mask by an 
instruction executed subsequent to said first instruction. 

23. The execution unit of claim 17, wherein said first instruction is said greater than instruction, and 
wherein said comparator unit is configured to determine if said first operand is greater than said second operand 
in response to receiving a decoded opcode value indicative of said greater than instruction, and wherein said 
comparator unit is further configured to convey a fourth set of control signal values to said multiplexer which 
specify either said first constant value or said second constant value. 

24. The execution unit of claim 23, wherein said first operand is greater than said second operand, 
and wherein said fourth set of control signal values specify said first constant value, and wherein said first 
constant value is conveyed as said result of said first instruction in response to receiving said fourth set of control 
signal values. 

25. The execution unit of claim 24, wherein said first constant value includes a plurality of bits, 
wherein all of said plurality of bits are set, and wherein said first constant value is usable as a mask by an 
instruction executed subsequent to said first instruction. 

26. The execution unit of claim 23, wherein said first operand is less than said second operand, and 
wherein said fourth set of control signal values specify said second constant value, and wherein said second 
constant value is conveyed as said result of said first instruction in response to receiving said fourth set of control 
signal values. 

27. The execution unit of claim 26, wherein said second constant value includes a plurality of bits, 
wherein none of said plurality of bits are set, and wherein said second constant value is usable as a mask by an 
instruction executed subsequent to said first instruction. 
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28. The execution unit of claim 17, wherein said first instruction is said equality instruction, and 
wherein said comparator unit is configured to determine if said first operand is equal to said second operand in 
response to receiving a decoded opcode value indicative of said equality instruction, and wherein said comparator 
unit is further configured to convey a fifth set of control signal values to said multiplexer which specify either 
said first constant value or said second constant value. 

29. The execution unit of claim 28, wherein, if said first operand is equal to said second operand, 
said fifth set of control signal values specify said first constant value, causing said multiplexer to convey said first 
constant value as said result of said first instruction. 

30. The execution unit of claim 29, wherein said first constant value includes a plurality of bits, 
wherein all of said plurality of bits are set, and wherein said first constant value is usable as a mask by an 
instruction executed subsequent to said first instruction. 

3 1 . The execution unit of claim 28, wherein, if said first operand is not equal to said second 
operand, said fifth set of control signal values specify said second constant value, causing said multiplexer to 
convey said second constant value as said result of said first instruction. 

32. The execution unit of claim 31, wherein said second constant value includes a plurality of bits, 
wherein none of said plurality of bits are set, and wherein said second constant value is usable as a mask by an 
instruction executed subsequent to said first instruction. 

33. The execution unit of claim 13, wherein said first operand is conveyed from a first register 
within a register file, and wherein said first operand is conveyed from said register file in response to said register 
file receiving said value of said first operand field, and wherein said value of said first operand field specifies said 
first register within said register file. 

34. The execution unit of claim 33, wherein said execution unit is further configured to convey said 
result of said first instruction to said register file, and wherein said register file is configured to store said result as 
a new value of said first register. 

35. The execution unit of claim 34, wherein said second operand is conveyed from a second 
register within said register file, and wherein said second operand is conveyed from said register file in response 
to said register file receiving said value of said second operand field, and wherein said value of said second 
operand field specifies said second register within said register file. 

36. The execution unit of claim 34, wherein said second operand is conveyed from a first memory 
location within a memory unit, and wherein said second operand is conveyed from said memory unit in response 
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to said memory unit receiving said value of said second operand field, and wherein said value of said second 
operand field specifies said first memory location within said memory unit. 



37. 



The execution unit of claim 34, wherein said second operand is an immediate value. 



5 



38. 



The execution unit of claim 37, wherein said immediate value is specified by said value of said 



second operand field. 



39. 



The execution unit of claim 13, wherein said first operand and second operand are floating 



10 



point numbers. 



40. 



An execution unit in a microprocessor for executing a first instruction, wherein an encoded 



representation of said first instruction includes an opcode field, a first operand field, and a second operand field, 
said execution unit comprising: 



field, wherein said first operand includes a first vector value followed by a second vector value; 

a second input register coupled to receive a second operand specified by a value of said second 
operand field, wherein said second operand includes a third vector value followed by a fourth vector value; 

a comparator unit coupled to receive said first operand from said first input register and said 
20 second operand from said second input register, and wherein said comparator unit is coupled to receive a decoded 
opcode value of said opcode field on a decoded opcode bus; 

a first multiplexer coupled to receive a first plurality of inputs including said first vector value 
from said first input register, said third vector value from said second input register, a first constant value, and a 
second constant value, and wherein said first multiplexer is configured to select one of said first plurality of inputs 
25 to be conveyed as a first portion of a vector result of said first instruction in response to receiving a first set of 
control signal values from said comparator unit; 

and wherein said comparator unit is configured to generate said first set of control signal values 
in response to receiving said decoded opcode value, and wherein, if said decoded opcode value indicates that said 
first instruction is one of a plurality of extreme value instructions, said first set of control signal values are usable 
30 to select either said first vector value or said third vector value as said first portion of said vector result, and 
wherein, if said decoded opcode value indicates that said first instruction is one of a plurality of compare 
instructions, said first set of control signal values are usable to select either said first constant value or said second 
constant value as said first portion of said vector result. 

35 41. The execution unit of claim 40, further comprising a second multiplexer coupled to receive a 

second plurality of inputs including said second vector value from said first input register, said fourth vector value 
from said second input register, said first constant value, and said second constant value, and wherein said second 
multiplexer is configured to select one of said second plurality of inputs to be conveyed as a second vector 



15 



a first input register coupled to receive a first operand specified by a value of said first operand 
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portion of said result of said first instruction in response to receiving a second set of control signal values from 
said comparator unit; 



42. The execution unit of claim 41, wherein said comparator unit is configured to generate said 
second set of control signal values in response to receiving said decoded opcode value, and wherein, if said 
decoded opcode value indicates that said first instruction is one of said plurality of extreme value instructions, 
said second set of control signal values are usable by said second multiplexer to select either said second vector 
value or said fourth vector value as said second portion of said vector result, and wherein, if said decoded opcode 
value indicates that said first instruction is one of a plurality of compare instructions, said first set of control 
signal values are usable by said second multiplexer to select either said first constant value or said second 
constant value as said second portion of said vector result. 

43. The method of claim 42, wherein said plurality of extreme value instructions includes a 
minimum value instruction and a maximum value instruction. 

44. The execution unit of claim 44, wherein said plurality of compare instructions includes a greater 
than or equal to instruction, a greater than instruction, and an equality instruction. 

45. A method for executing a first instruction within an execution unit of a microprocessor, wherein 
an encoded representation of said first instruction includes an opcode field, a first operand field, and a second 
operand field, said method comprising: 

conveying a first plurality of inputs to a comparator unit within said execution unit, wherein 
said first plurality of inputs includes a first operand specified by a value of said first operand field, a second 
operand specified by a value of said second operand field, and a decoded opcode value which corresponds to an 
encoded opcode value of said opcode field; 

conveying a second plurality of inputs to a multiplexer within said execution unit, wherein said 
second plurality of inputs includes said first operand, said second operand, a first constant value, and a second 
constant value; 

generating a set of control signal values from said comparator unit in response to receiving said 
first plurality of inputs; 

conveying one of said second plurality of inputs from said multiplexer as a result of said first 
instruction in response to receiving said set of control signal values; 

and wherein said result is selected from said first operand and said second operand according to 
said set of control signal values if said decoded opcode value indicates that said first instruction corresponds to 
one of a plurality of extreme value instructions; 

and wherein result is selected from said first constant value and said second constant value 
according to said set of control signal values if said decoded opcode value indicates that said first instruction 
corresponds to one of a plurality of compare instructions. 
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46. The method of claim 45, wherein said plurality of extreme value instructions includes a _ 
minimum value instruction and a maximum value instruction. 



47. The execution unit of claim 45, wherein said plurality of compare instructions includes a greater 
than or equal to instruction, a greater than instruction, and an equality instruction. 

48. The method of claim 45, further comprising conveying said first operand to said multiplexer 
and said comparator unit from a first register within a register file, and wherein said first operand is conveyed 
from said register file in response to said register file receiving said value of said first operand field, and wherein 
said value of said first operand field specifies said first register within said register file. 

49. The method of claim 48, further comprising conveying said result of said first instruction to said 
register file, wherein said register file is configured to store said result as a new value of said first register. 

50. The method of claim 49, wherein said first operand includes a first vector value followed by a 
second vector value, and wherein said second operand includes a third vector value followed by a fourth vector 
value. 

51. The method of claim 50, wherein said conveying said second plurality of inputs to said 
multiplexer includes conveying said first vector value and said third vector value. 

52. The method of claim 5 1 , wherein said conveying said one of said second plurality of inputs 
from said multiplexer includes conveying a first vector portion of said result of said instruction. 

53. The method of claim 52, further comprising conveying a third set of inputs to a second 
multiplexer within said execution unit, wherein said third set of inputs includes said second vector value, said 
fourth vector value, said first constant value, and said second constant value. 

54. The method of claim 53, further comprising generating a second set of control signals from said 
comparator unit in response to receiving said first plurality of inputs. 

55. The method of claim 54, further comprising conveying one of said third plurality of inputs from 
said second multiplexer as a second vector portion of said result of said first instruction. 

56. The method of claim 55, wherein said second vector portion of said result is selected from said 
third vector value and said fourth vector value according to said second set of control signal values if said 
decoded opcode value indicates that said first instruction corresponds to one of said plurality of extreme value 
instructions. 
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57. The method of claim 56, wherein second vector portion of said result is selected from sai d firs t 
constant value and said second constant value according to said second set of control signal values if said decoded 
opcode value indicates that said first instruction corresponds to one of said plurality of compare instructions. 

58. A microprocessor configured to execute a first instruction, comprising: 

an instruction cache configured to store an encoded representation of said first instruction, 
wherein said encoded representation includes an opcode field, a first operand field, and a second operand field; 

a decode unit coupled to receive said encoded representation of said first instruction from said 
instruction cache, and wherein said decode unit is configured to generate a decoded opcode value in response to 
receiving a value of said opcode field; 

an execution unit coupled to said decode unit, wherein said decode unit is further configured to 
cause a first operand and a second operand to be conveyed to said execution unit, wherein said first operand is 
specified by a value of said first operand field, wherein said second operand is specified by a value of said second 
operand field, and wherein said execution unit includes: 

a first input register coupled to receive a first operand specified by a value of said first 

operand field; 

a second input register coupled to receive a second operand specified by a value of 

said second operand field; 

a comparator unit coupled to receive said first operand from said first input register 
and said second operand from said second input register, and wherein said comparator unit is coupled to receive a 
decoded opcode value of said opcode field on a decoded opcode bus; 

a multiplexer coupled to receive a plurality of inputs including said first operand from 
said first input register, said second operand from said second input register, a first constant value, and a second 
constant value, and wherein said multiplexer is configured to select one of said plurality of inputs to be conveyed 
as a result of said first instruction in response to receiving a control signal value from said comparator unit; 

and wherein said comparator unit is configured to generate said control signal value in response 
to receiving said decoded opcode value, and wherein, if said decoded opcode value indicates that said first 
instruction is one of a plurality of extreme value instructions, said control signal value is usable to select either 
said first operand or said second operand as said output value, and wherein, if said decoded opcode value 
indicates that said first instruction is one of a plurality of compare instructions, said control signal is usable to 
select either said first constant value or said second constant value as said output value. 
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PFMIN 100 



mnemonic opcode/imm8 description 

PFMIN mmmregl , mmreg2/mem64 Ofh Gfh / 94h Packed floating-point minimum 

/ / / 

102A 102B 101 

FIG. 2A 



IF (mmregl [31 :0]<mmreg2/mem64[31 :0]) 
THEN mmregl [31 :0]=mmreg1 [31 :0] 

ELSE mmregl [31 :0]=mmreg2/mem64[31 :0] 

IF (mmregl [63:32]<mmreg2/mem64[63:32]) 
THEN mmregl [63:32]=mmreg1 [63:32] 

ELSE mmreg1[63:32]=mmreg2/mem64[63:32] 

FIG. 2B 



PFMIN 


Source 2 


Source 1 & 
Destination 




0 


Normal 


Unsupported 


0 


+0 


Source 2, +0* 


Undefined 


Normal 


Source 1, +0** 


Source 1 /Source 2 
*** 


Undefined 


Unsupported 


Undefined 


Undefined 


Undefined 


Notes: 

* The result is source 2 if source 2 is negative otherwise the result is positive zero. 

** The result is source 1 if source 1 is negative otherwise the result is positive zero. 

*** The result is source 1 if source 1 is negative and source 2 is positive. The result is 
source 1 if both are negative and source 1 is greater in magnitude than source 2. The 
result is source 1 if both are positive and source 1 is lesser in magnitude than source 
2. The result is source 2 in all other cases. 



FIG. 2C 



WO 99/19791 



PCT/US98/12666 



3/9 



PFMAX 200 



mnemonic opcode/imm8 description 

PFMAX mmmregl , mmreg2/mem64 Ofh Ofh / A4h Packed floating-point minimum 

/ / / 

202A 202B 201 

FIG. 3A 



IF {mmregl [31 :0]>mmreg2/mem64[31 :0]) 

THEN mmregl [31 :0]=mmreg1 [31:0] 
ELSE mmregl [31 :0]=mmreg2/mem64[31 :0] 
IF (mmreg 1 [63:32]>mmreg2/mem64[63:32]) 

THEN mmregl [63:32]=mmreg1 [63:32] 
ELSE mmregl [63:32]=mmreg2/mem64[63:32] 

FIG. 3B 



PFMAX 


Source 2 


Source 1 & 
Destination 




0 


Normal 


Unsupported 


0 


+0 


Source 2, +0* 


Undefined 


Normal 


Source 1, +0** 


Source 1 /Source 2 

*** 


Undefined 


Unsupported 


Undefined 


Undefined 


Undefined 


Notes: 

* The result is source 2 if source 2 is positive otherwise the result is positive zero. ! 
** The result is source 1 if source 1 is positive otherwise the result is positive zero. 
*** The result is source 1 if source 1 is positive and source 2 is negative. The result is 

source 1 if both are positive and source 1 is greater in magnitude than source 2. 

The result is source 1 if both are negative and source 1 is lesser in magnitude than 

source 2. The result is source 2 in all other cases. 
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PFCMPEQ 300 

mnemonic opcode/imm8 description 

PFCMPEQ mmmregl, mmreg2/mem64 Ofh Ofh / BOh Packed floating-point 

f f f comparison equal 

302A 302B 301 

FIG. 4A 



I F (mmreg 1 [3 1 :0]=mmreg2/mem64[3 1 :01) 

THEN mmreg 1 [31 :0]=FFFF_FFFFh 
ELSE mmreg 1 [31 :0]=0000_0000h 
IF(mmreg1[63:32]=mmreg2/mem64[63:3 

THEN mmreg 1[63:32]=FFFF_FFFFh 
ELSE mmreg 1[63:32]=0000_0000h 



FIG. 4B 



PFCMPEQ 


Source 2 


Source 1 & 
Destination 




0 


Normal 


Unsupported 


0 


FFFF_FFFFh* 


0000_0000h 


0000_0000h 


Normal 


0000_0000h 


UUUU UUOOh, 
FFFF_FFFFh** 


0000_0000h 


Unsupported 


0000_0000h 


0000_0000h 


Undefined 


Notes: 

* Positive zero is equal to negative zero. 

The result is FFFF_FFFFh if source 1 and source 2 have identical signs, exponents 
and mantissas: It is 0000_0000h otherwise. 
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PFCMPGT 



400 



mnemonic 



opcode/imm8 description 



PFCMPGT mmmregl , mmreg2/mem64 Ofh Ofh / AOh 

/ / / 

402A 402B 401 

FIG. 5A 



Packed floating-point, 
comparison greater 



IF (mmreg1[31 :0]>mmreg2/mem64[31 :0]) 
THEN mmregl [31 :0]=FFFF_FFFFh 

ELSE mmregl [31 :0]=0000_0000h 

IF(mmreg1[63:32]>mmreg2/mem64[63:32]) 
THEN mmregl [63:32]=FFFF_FFFFh 

ELSE mmregl [63:32]=0000_0000h 



FIG. 5B 



PFCMPGT 


Source 2 






0 


Normal 


Unsupported 


Source 1 & 


0 


0000_0000h 


U000 OOOOh, 
FFFF FFFFh* 


Undefined 


Destination 


Normal 


UUUO OUUOh, 
FFFF_FFFF** 


0000 OOOOh 
FFFF_FFFFh*** 


Undefined 




Unsupported 


Undefined 


Undefined 


Undefined 


Notes: 

* The result is FFFF_FFFFh if source 2 is negative, otherwise the result is 0000_0000h. 
** The result is FFFF_FFFFh if source 1 is positive, otherwise the result is 0000_0000h. 
*** The result is FFFF_FFFFh if source 1 is positive and source 2 is negative, or if they are 
both negative and source 1 is smaller in magnitude than source 2, or if source 1 and 
source 2 are positive and source 1 is greater in magnitude than source 2. The result is 
0000 OOOOh in all other cases. 
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PFCMPGE 500 

mnemonic opcode/imm8 description 

PCMPGE mmregl , mmreg2/mem64 OFh OFh / 90h Packed floating-point comparison, 
^ greater or equal 

502A 502B 501 

FIG. 6A 



IF (mmregl [31 :0]>=mmreg2/mem64[31 :01) 
THEN mmregl [31 :0]=FFFF_FFFFh 

ELSE mmreg1[31:0]=0000_0000h 

IF (mmregl [63:32]>=mmreg2/mem64[63:32]) 
THEN mmregl [63:32]=FFFF_FFFFh 

ELSE mmregl [63:32]=0000_0000h 

FIG. 6B 



PFCMPGE 


Source 2 


Source 1 & 
Destination 




0 


Normal 


Unsupported 


0 


FFFF.FFFFh* 


UUU0 OOOOh, 
FFFF FFFFh" 


Undefined 


Normal 


0UUU OOOOh, 
FFFF FFFF*** 


0000 OOUOh. 
FFFF FFFFh"" 


Undefined 


Unsupported 


Undefined 


Undefined 


Undefined 


Notes: 

Positive zero is equal to negative zero. 
" The result is FFFF.FFFFh if source 2 is negative, otherwise the result is 0000 OOOOh. 
*" The result is FFFF_FFFFh if source 1 is positive, otherwise the result is 0000 OOOOh. 
**** The result is FFFF_FFFFh if source 1 is positive and source 2 is negative, or if they 

are both negative and source 1 is smaller or equal in magnitude than source 2, or if 

source 1 and source 2 are both positive and source 1 is greater or equal in magnitude 

than source 2. The result is 0000 OOOOh in all other cases. 
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